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Abstract 



Walley's Imprecise Dirichlet Model (IDM) for categorical data overcomes 
several fundamental problems which other approaches to uncertainty suffer 
from. Yet, to be useful in practice, one needs efficient ways for computing 
the imprecise=robust sets or intervals. The main objective of this work is 
to derive exact, conservative, and approximate, robust and credible interval 
estimates under the IDM for a large class of statistical estimators, including 
the entropy and mutual information. 
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1 Introduction 



This work derives interval estimates under the Imprecise Dirichlet Model (IDM) 
Wal96 for a large class of statistical estimators. In the IDM one considers an i.i.d. 



process with unknown chances 1 7T; for outcome i. The prior uncertainty about w 
is modeled by a set of Dirichlet priors 

{p(7r)ocnX t<-1 ^ *eA}, where 4 A:={t : 
ti>0, X^ = l}> an d s is a hyper-parameter, typically chosen between 1 and 2. Sets 
of probability distributions are often called Imprecise probabilities, hence the name 
IDM for this model. We avoid the term imprecise and use robust instead, or capitalize 
Imprecise. IDM overcomes several fundamental problems which other approaches 
to uncertainty suffer from |Wal96j . For instance, IDM satisfies the representation 



invariance principle and the symmetry principle, which are mutually exclusive in 
a pure Bayesian treatment with proper prior }Wal96j . The counts n, for i form 
a minimal sufficient statistic of the data of size n = Yli n i- Statistical estimators 
F(n) usually also depend on the chosen prior: so a set of priors leads to a set of 
estimators {F t (n) : £gA}. For instance, the expected chances E t \n^[ = —'■ 
Ui(t) lead to a robust interval estimate 9 E t [7ii\. Robust intervals for 

the variance Var[7Tj] [Wal96] and for the mean and variance of linear-combinations 
^jQtiTTi have also been derived |Ber01j . Bayesian estimators (like expectations) 
depend on t and n only through u (and n+s which we suppress), i.e. F t {n) = F{u). 
The main objective of this work is to derive approximate, conservative, and exact 
intervals [min te Ai r (it),max te A-P 1 (w)] for general F(u), and for the expected (also 
called predictive) entropy and the expected mutual information in particular. These 
results are key building blocks for applying IDM. Walley suggests, for instance, to 
use min t P t \T > c] > a for inference problems and mm t E-^T\^c for decision problems 
Wal96., where T is some function of 7r. One application is the inference of robust 



tree-dependency structures |Zaf01|IZH03| . in which edges are partially ordered based 
on Imprecise mutual information. 

Section |2] gives a brief introduction to IDM and describes our problem setup. 
In Section |3] we derive exact robust intervals for concave functions F, such as the 
entropy. Section 0] derives approximate robust intervals for arbitrary F. In Section |3] 
we show how bounds of elementary functions can be used to get bounds for composite 
function, especially for sums and products of functions. The results are used in 
Section |U] for deriving robust intervals for the mutual information. The issue of 
how to set up IDM models on product spaces is discussed in Section Section |S] 



1 Also called objective or aleatory probabilities. 

2 We denote vectors by x:=(xi,...,Xd) for xE{n,t,u,ir,...}. 

3 Also called second order or subjective or belief or epistemic probabilities. 

4 Strictly speaking, A should be the open simplex [Wal96| . since p(ir) is improper for t on 
the boundary of A. For simplicity we assume that, if necessary, considered functions of t can 
and are continuously extended to the boundary of A, so that, for instance, minima and maxima 
exist. All considerations can straightforwardly, but cumbersomely, be rewritten in terms of an 
open simplex. Note that open/closed A result in open/closed robust intervals, the difference being 
numerically /practically irrelevant . 
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addresses the problem of how to combine Bayesian credible intervals with the robust 
intervals of the IDM. Conclusions are given in Section 

2 The Imprecise Dirichlet Model 

Random i.i.d. processes. We consider discrete random variables iE{l,...,d} and 
an i.i.d. random process with outcome iE{l,...,d} having probability 7T». The chances 
7r form a probability distribution, i.e. ir G A := {x G lR d : x% > Vz, x + = 1}, where 
we have used the abbreviation x— (xx,...,Xd) and x + :=X^=i x i- The likelihood of 
a specific data set D with rii observations i and total sample size n = n + = Yli n i 
is p(D\ir) = Y[i 7[ ? i - The chances 7Tj are usually unknown and have to be estimated 
from the sample frequencies rij. The frequency estimate ^ for 7Tj is one possible 
point estimate. 

Second order p(oste)rior. In the Bayesian approach one models the initial uncer- 
tainty in 7r by a (second order) prior "belief" distribution p(ir) with domain 7rG A. 
The Dirichlet priors p(7r) oc Yli^i ' > where n- comprises prior information, repre- 
sent a large class of priors, n- may be interpreted as (possibly fractional) virtual 
number of "observation". High prior belief in z can be modeled by large n^. It is 
convenient to write n' i = s-ti with s:=n' + , hence £gA. Having no initial bias one 
should choose a prior in which all are equal, i.e. — \ Vz. Examples for s are 
for Haldane's prior Hal48j, 1 for Perks' prior |Per47j . ^ for Jeffreys' prior |Jef46j . 
and d for Bayes-Laplace's uniform prior |GCSR95] . From the prior and the data 
likelihood one can determine the posterior p{iz\D) =p(ir\n) ccYlirf^^' 1 ■ 

The posterior p(7r \ D) summarizes all statistical information available in the data. 
In general, the posterior is a very complex object, so we are interested in summaries 
of this plethora of information. A possible summary is the expected value or mean 
-^iM — "ff which is often used for estimating 7Tj. The accuracy may be obtained 
from the covariance of 7r. 

Usually one is not only interested in an estimation of the whole vector tt, but 
also in an estimation of scalar functions JF : A — > M of 7r, such as the entropy 
H(tt) = — ^ ?; 7Tjlog7rj, where log denotes the natural logarithm. Since T is itself a 
random variable we could determine the posterior distribution p(jF |n) = J A 5(jF(7r) — 
jF )p(7r \n)d7r of J 7 , which may further be summarized by the posterior mean E t [J-] = 
f A JF("7r)£>(7r|n)<i7r and possibly the posterior variance VarfjF]. A simple, but crude 
approximation for the mean can be obtained by exchanging E with T (exact only 
for linear functions): EtiJ 7 (n)} ^ J 7 (E t [n}) . The approximation error is typically of 
the order -. 

n 

The Imprecise Dirichlet Model. The classical approach, which consists of select- 
ing a single prior, suffers from a number of problems. Firstly, choosing for example 
a uniform prior U — \i ^ ne P r i° r depends on the particular choice of the sampling 
space. Secondly, it assumes exact prior knowledge of p{iz). The solution to the 
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second problem is to model our ignorance by considering sets of priors p(tt), often 
called Imprecise probabilities. The specific Imprecise Dirichlet Model (IDM) Wal96j 
considers the set of alltEA, i.e. {p(ir\n):tEA} which solves also the first problem. 
Walley suggests to fix the hyperparameter s somewhere in the interval [1,2]. A set 
of priors results in a set of posteriors, set of expected values, etc. For real-valued 
quantities like the expected entropy E t \H] the sets are typically intervals, which we 
call robust intervals 

E t m E [minEjF] , maxEjF]]. 
teA teA 

Problem setup and notation. Consider any statistical estimator F. F is a func- 
tion of the data D and the hyperparameters t. We define the general correspondence 

u- = — ^ — — , where ••• can be various superscripts. (1) 
n + s 

F can, hence, be rewritten as a function of u and D. Since we regard D as fixed, we 
suppress this dependence and simply write F = F(u). This is further motivated by 
the fact that all Bayesian estimators of functions T of 7r only depend on u and the 
sample size n+s. It is easy to see that this holds for the mean, i.e. E t [J-']=F(u ; n+s), 
and similarly for the variance and all higher (central) moments. The main focus of 
this work is to derive exact and approximate expressions for upper and lower F 
values 

F :=maxF(u) and F:=mmF(u), F:=\F,F] 
teA y ' ~ teA — l— ' j 

t E A <^> u E A', where A' := {u : Ui > u + = 1}. We define u F as the u E A' 

which maximizes F, i.e. F = F(u F ), and similarly t F through relation (JTJ). If the 
maximum of F is assumed in a corner of A' we denote the index of the corner by 
i F , i.e. t F = <5jjF, where Sij is Kronecker's delta function. Similarly u—, t—, i—. 



3 Exact Robust Intervals for Concave Estimators 

In this section we derive exact expressions for F_ if F : A — > M is of the form 

d 

F{u) = fi u i) and concave / : [0, 1] -> M. (2) 

i=i 

The expected entropy is such an example (discussed later). Convex / are treated 
similarly (or simply take — /). 

The nature of the solution. The approach to a solution of this problem is 
motivated as follows: Due to symmetry and concavity of F, the global maximum is 
attained at the center Ui = ^ of the probability simplex A, i.e. the more uniform u 
is, the larger F(u). The nearer u is to a vertex of A, i.e. the more unbalanced u is, 
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the smaller is F(u). The constraints tj >0 restrict u to the smaller simplex 

A' = {u : Ui > u + = 1} with w° := — — , 

L _ 1 + J * n + s 

which prevents setting uf = j and uj~ = 5n. Nevertheless, the basic idea of choosing 
ii as uniform / as unbalanced as possible still works, as we will see. 

Greedy F(u) minimization. Consider the following procedure for obtaining u—. 
We start with t = (outside the usual domain A of F, which can be extended to 
[0,1] d via (0)) and then gradually increase t in an axis-parallel way until t + = 1. 
With axis-parallel we mean that only one component of t is increased, which one 
possibly changes during the process. The total zigzag curve from t start = to t end 
has length t e ™ d = 1. Since all possible curves have the same (Manhattan) length 
1, F{u end ) is minimized for the curve which has (on average) smallest F-gradient 
along its path. A greedy strategy is to follow the direction i of currently smallest 
F-gradient |^ = Since /' is monotone decreasing (/"<0), |^ is smallest 

for largest u, t . At t start = 0, u i = : ^ is largest for i = i mm := argmax^nj. Once we 
start in direction z mm , Ujmin increases even further whereas all other U; L {i ^ 2 mm ) 
remain constant. So the moving direction is never changed and finally we reach a 
local minimum at tf nd = 5 ii min. In [Hut03j we show that this is a global minimum, 
i.e. 

tf~ — SiiM- with i— := argmaxrij. (3) 

i 

Greedy F(u) maximization. Similarly we maximize F(u). Now we increase t in 

direction i = i 1 of maximal which is the direction of smallest Ui ocnj+stj. Again, 

(only) Ui 1 increases, but possibly reaches a value where it is no longer the smallest 

one. We stop if it becomes equal to the second smallest u i: say i = i2- We now 

have to increase and Ui 2 with same speed (or in an e-zigzag fashion) until they 

become equal to u i3 , etc or until u + = l = t + is reached. Assume the process stops 

with direction i m and minimal u being u, i.e. finally Ui k = u for k<m and t ik =0 for 

k>m. From the constraint l = M+ = Efc<A+Efc>A= m w+Efc> m ^ we obtain 

^( m ) = ^[i-E^mS^l = l s + J2k<m n ik]/l m ( n + s )}- 0ne can show that u(m) has 
one global minimum (no local ones) and that the final m is the one which minimizes 



u = mm — = — r — , where < < ... < rij., u { = max{u",ii}. (4) 

me{i...d} m(n + s) 

If there is a unique minimal with gap > s to the 2nd smallest (which is quite 
likely for not too small n) , then m— 1 and the maximum is attained at a corner of 
A (A'). 

Theorem 1 (Exact extrema for concave functions on simplices) Assume 
F:A'^M is a concave function of the form F(u) =Ei=ii '( u i) ■ Then F attains the 
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global maximum F at u F defined in and the global minimum F_ at u— defined 
in (0). 

Proof. What remains to be shown is that the solutions obtained in the last 
paragraphs by greedy minimization/maximization of F(u) are actually global min- 
ima/maxima. For this assume that t is a local minimum of F(u). Let j :=argmaxjitj 
(ties broken arbitrarily). Assume that there is a kj^j with non-zero t k . Define il as 
t\ = U for all i^j,k, and t'j = tj+e, t' k = t k — e, for some 0<e<t k - From u k <u.j and 
the concavity of / we get 5 

F(t')-F(t) = [/W) + /(«;)]-[/(«,■) + /(«*)] 

= [f(uj+ae) - f(uj)] - [f(u k ) - f(u k -ae)] < 

where c:—^^. This contradicts the minimality assumption of t. Hence, tj = for 
all i except one (namely j, where it must be 1). (Local) minima are attained in the 
vertices of A. Obviously the global minimum is for tj- = b~aF_ with i— := argmax^nj. 
This solution coincides with the greedy solution. Note that the global minimum 
may not be unique, but since we are only interest in the value of F(u—) and not its 
argument this degeneracy is of no further significance. 

Similarly for the maximum, assume that t is a (local) maximum of F(u). Let 
j:=argminjMj (ties broken arbitrarily). Assume that there is a ky^j with non-zero t k 
and Uk>Uj. Define t' as above with < e < minjtfc , t k — tj}. Concavity of / implies 

F(t') - F(t) = [f( Uj + ae) - f( Uj )\ - [f(u k ) - f{u k -ae)\ > 0, 

which contradicts the maximality assumption of t. Hence t, = if is not minimal 
(u). The previous paragraph constructed the unique solution u F satisfying this con- 
dition. Since this is the only local maximum it must be the unique global maximum 
(contrast this to the minimum case). □ 

Theorem 2 (Exact extrema of expected entropy) Let 7Y(7r) =— ^^ilogTTi be 
the entropy of ir and the uncertainty of n be modeled by the Imprecise Dirichlet 
Model. The expected entropy H(u): = E t [H] for given hyperparameter t and sample 
n is given by 

n+s 

H{u) = ^ h(uj) with h{u) = u-[ip(n+s+l) - i/j((n+s)u + l)} = u k' 1 (5) 

i k=(n+s)u+l 

where tp(x) =dlogT(x)/dx is the logarithmic derivative of the Gamma function and 
the last expression is valid for integral s and (n + s)u. The lower and upper H 
expected entropies are assumed at u— and u H given in (0) and ^ ( with F H , 
see also ^\)). 

5 Slope ^(" +e )~^(") i s a decreasing function in u for any e>0, since / is concave. 
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A derivation of the exact expression (J3J for the expected entropy can be found 
in |WW95| IHut02j . The only thing to be shown is that h is concave. This may be 
done by exploiting special properties of the digamma function ip (see |AS74j ). 

There are fast implementations of ip an d its derivatives and exact expressions 
for integer and half-integer arguments 

Example. For d = 2, ni = 3, n 2 = 6, s = 1 we have n = 9, u\ = ^jf 1 , u 2 = ^j 2 -, t° = 0, 
u° = (;J), see (HJ. From ©, i s =2, *-=(?), « S =(;J). From gj), i x = 1, i 2 = 2, 
■u = min{ |±| , ^Iq 3 ^ 6 ) } — ^ ) = max{u°,M} = ( 6 ) =>- t- ff = ( Q ) is in corner. From 

= H> = U, = U> = |§,J*nce 1= [ff^),^] = 

IHto) + Hto) , + = [0.5639...,0.6256...], so H-H = 0{±). 

4 Approximate Robust Intervals 

In this section we derive approximations for F_ suitable for arbitrary, twice differen- 
tiate functions F(u). The derived approximations for F_ will be robust in the sense 
of covering set F_ (for any n), and the approximations will be "good" if n is not too 
small. In the following, we treat a := as a (small) expansion parameter. For 
u.u* G A' we have 

Ui — u* = a-(ti — t*) and \ui — u*\ = a\U — t*\ < o with a := (6) 

Hence we may Taylor-expand F(u) around it*, which leads to a Taylor series in a. 
This shows that F is approximately linear in u and hence in t. A linear function on 
a simplex assumes its extreme values at the vertices of the simplex. This has already 
been encountered in Sectional The consideration above is a simple explanation for 
this fact. This also shows that the robust interval F_ is of size F — F_ = 0(a). 6 Any 
approximation to F should hence be at least 0(a 2 ). The expansion of F to 0(a) is 

F =O(l) F R =Q{a) 

f{u) = F^+^imm^-u:) (7) 

i 

where diF{u) is the partial derivative dl Q^ of F{u) w.r.t. Ui. For suitable u = 
u(u,u*)eA' this expansion is exact (Fr is the exact remainder). Natural points for 
expansion are i* — h in the center of A, or possibly also t* = ^ = u*. See [Hut03j for 
such a general expansion. Here, we expand around the improper point t*: = t^ = 0, 
which is outside(!) A, since this makes expressions particularly simple! © is still 

6 f(n,t,s) = 0(a k ) 3cVn€lV rf ,£eA,s>0 : \f{n,t,s)\ <ca k , where cr= 
7 The order of accuracy 0(a 2 ) we will encounter is for all choices of u* the same. The concrete 
numerical errors differ of course. The choice t* = can lead to O(d) smaller Fr than the natural 
center point t* = 4, but is more likely a factor 0(1) larger. The exact numerical values depend on 
the structure of F. 
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valid in this case, and Fr is exact for some u in 

Ti ' 

A' := \u : Ui > vf}Vi, u + < lj, where w,° = — — . 

Note that we keep the exact condition u G A'. F is usually already defined on A' e 
or extends from A' to A' e without effort in a natural way (analytical continuation). 
We introduce the notation 

FUG F < G and F = G + 0(a 2 ) (8) 

stating that G is a "good" upper bound on F. The following bounds hold for 
arbitrary differentiable functions. In order for the bounds to be "good," F has to be 
Lipschitz differentiable in the sense that there exists a constant c such that 

\diF{u)\ < c and \diF(u) - d l F{u') \ < c\u - u'\ 



Vu,u'eA' e and \/l<i<d. (9) 

If F depends also on n, e.g. via a or u°, then c shall be independent of them. 

The Lipschitz condition is satisfied, for instance, if the curvature d 2 F is uniformly 
bounded. This is satisfied for the expected entropy H (see (J5J)), but violated for the 
approximation Et[H]~T~C(u) if rii — for some i. 

Theorem 3 (Approximate robust intervals) Assume F:A' e —>-]R is a Lipschitz 
differentiable function (QJ). Let [F_,F] be the global [minimum, maximum] of F re- 
stricted to A'. Then 

F(u l ) E F C Fo + F^ where F$ = max F% and F$ = a max[<9;F(u)] 

i " ueA' e 

Fo + Fg C F C F(u 2 ) w/iere = mini^, and F$ = a min [9 8 F(u)] 

F = F(u°), and u\ = S^i with i 1 = arg maXjF^, and u 2 = 5u2 with i 2 = arg min^F/^, 
and □ defined in (0) means < and = +0(<7 2 ) ; where a = l—u° + . 

For conservative estimates, the lower bound on F_ and the upper bound on F are 
the interesting ones. 

Proof. We start by giving an 0(a 2 ) bound on F# = max. ue A'F R ('u). We first insert 
(J5]l with i * = t° = into (J7j) and treat tt and t as separate variables: 

F R {u,t) = *Y^[diF(u)]-U E maxja^^F^)]-^! E J^F^-U 



(10) 
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The first inequality is obvious, the second follows from the convexity of max. From 
assumption we get diF (u) — diF (u') = 0(a) for all u,u' e A' e , since A' e has 
diameter 0(a). Due to one additional a in ([T0|) the expressions in ([10|) change 
only by 0(a 2 ) when introducing or dropping max& anywhere. This shows that the 
inequalities are tight within 0(a 2 ) and justifies C. We now upper bound Fr(u): 

Fr = max F R (u) C max max Fr(h, t) C max V ■ ^ = max =: Fjf (11) 
ueA' teA -ueAj, teA ^— ' i 

i 

A linear function on A is maximized by setting the ^ component with largest coeffi- 
cient to 1. This shows the last equality. The maximization over u in (jlUj) can often 
be performed analytically, leaving an easy 0(d) time task for maximizing over i. 

We have derived an upper bound Fr on Fr. Let us define the corner tj = S^i 
of A with i 1 := argmaxji^. Since Fr > Fr(u) for all u, Fr(v}) in particular 
is a lower bound on Fr. A similar line of reasoning as above shows that that 
F R (v}) = F R J r O(a 2 ). Using F+ const. = F + const. we get 0(a 2 ) lower and upper 
bounds on F, i.e. F(u 1 )\ZF\ZF +F R b . F_ is bound similarly with all max's replaced 
by min's and inequalities reversed. Together this proves the Theorem El □ 



5 Error Propagation 

Approximation of F_ (special cases). For the special case F(u) = Ylifi. u ) we 
have diF(u) = f'(ui). For concave / like in case of the entropy we get particularly 
simple bounds 

F$ = ff max/'( M! ) = af{uf), F$ = crmax/'K) = af'(^), 

Ff R = a min f(u % ) = af(u» + a), F l R b = a win f'(u° + a) = ^(SSjgls), 

where we have used maXugA^/'^j) =max u . 6 r u o u o +0 .i/ / (w i ) = /'(«?), and similarly for 
min. Analogous results hold for convex functions. In case the maximum cannot be 
found exactly one is allowed to further increase A' e as long as its diameter remains 
0(a). Often an increase to □' := {u: < Ui < u® + a} D A' e D A' makes the problem 
easy. Note that if we were to perform these kind of crude enlargements on max u F(u) 
directly we would loose the bounds by 0(a). 

Example (continued), a = i h'( A) = ff - In 2 , h'(^) = fgf - |n 2 , H = 
H(u°)=h{±) + h{±), Hg = ±h'{&, H l £ = ±h'(±) =► [H + H lb ,H + H^} = 
[0.5564.. .,0.6404...], hence H +H#-H = 0.0U8 = O(^), H-H -H%= 0.0074...= 

Error propagation. Assume we found bounds for estimators G(u) and H(u) 
and we want now to bound the sum F(u) :=G(u) + H(u). In the direct approach 
F<G + H we may lose 0(a). A simple example is G(u)=Ui and H(u) = —Ui for 
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which F(u) = 0, hence 0=F<G+H = w°+cr-w° = cx, i.e. F^G+H. We can exploit 
the techniques of the previous section to obtain 0(a 2 ) approximations. 

F^ = amaxdiF(u) C a max <9;G(u) + a max diH(u) = + 

u£A^ u£A.' e u£A.' e 



Theorem 4 (Error propagation: Sum) Let G(u) and H{u) be Lipschitz differ- 
entiable and F(u) = aG(u) + (3H{u), a,f3>0, thenFQFo + Fg and FZ\ F + F l R b , 
where F = aG +f3H , and F^QaG^+fiH^, and F^aG l > R + f3H lb R . 

It is important to notice that F^^G^+H"^ 1 (use previous example), i.e. maXj[G"^+ 
H™ R ] [2maXjG^+maXjif ^. maxj can not be pulled in and it is important to propa- 
gate F™ R , rather than F R b . 

Every function F with bounded curvature can be written as a sum of a concave 
function G and a convex function H . For convex and concave functions, determining 
bounds is particularly easy, as we have seen. Often F decomposes naturally into 
convex and concave parts as is the case for the mutual information, addressed later. 
Bounds can also be derived for products. 

Theorem 5 (Error propagation: Product) Let G,H : A' e — > [0,oo) be non-nega- 
tive Lipschitz differentiable functions (GJ) with non-negative derivatives diG,diH>0 
Vi andF(u) = G(u)-H(u), thenF^F +F% b , where F = G -H , and F$C : G&(H + 
h r) + (G +G$)H$, and similarly for F. 

Proof. We have 

F% = amaxdiF = <x max <%(£•#) = a max[(diG)H + G(d,H)} C 

a(maxdiG)(maxH) + a(m&xG)(maxd i H) C G^(H + H^ b ) + (G + G u R b )H^ 

where all functions depend on u and all max are over u€ A' e . There is one subtlety 
in the last inequality: maxG ^ G C Go + G R b . The reason for the 7^ being that the 
maximization is taken over A' e , not over A' as in the definition of G. The correct 
line of reasoning is as follows: 

max G R (u) C max V Gf R ■ t, = max{0, max Gf R ] = G u R b max G C G + G U R 

u6A' e teA e i 

i 

The first inequality can be proven in the same way as ([lljh In the first equality we 
set the ti = l with maximal Gf R if it is positive. If all Gf R are negative we set t = 0. 
We assumed G>0 and diG>0, which implies G R >0. So, since G R >0 anyway, this 
subtlety is ineffective. Similarly for ma,xH R . □ 

It is possible to remove the rather strong non-negativity assumptions. Propaga- 
tion of errors for other combinations like ratios F = G/H may also be obtained. 
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6 Robust Intervals for Mutual Information 



Mutual Information. We illustrate the application of the previous results on the 
Mutual Information between two random variables 16 {l,...,di} and jG {l,...,^}- 
Consider an i.i.d. random process with outcome e {1,...,g?i}x {l,...,d 2 } having 
joint probability 7Ty, where 7rG A: = {x^M dlXd2 : Xij>0Vij, x ++ = l}. An important 
measure of the stochastic dependence of i and j is the mutual information 



U1 d 2 ^ 

J ( 7r ) = T v lo S — = *vi log7r ^' ~E Wi + ^°S^i+ iog^+j ( 12 ) 



i=l j=l l+ +J 



iTi + = Ylj^ij and = Yli^ij are row and column marginal chances. Again, we 
assume a Dirichlet prior over which leads to a Dirichlet posterior p(ir l3 \n) oc 
rii7 7r « n / J+S * y 1 with te A. The expected value of tt^ is 



hi t mini = — =■ ua 

1 Jj n + s 3 



The marginals 7r i+ and 7r + j are also Dirichlet with expectation Ui + and u + j. The 
expected mutual information I(u) : = E t [X] can, hence, be expressed in terms of the 
expectations of three entropies H(u) : = E t [H] (see (JSJ) 



7(u) = H(u l+ ) + H(u +J ) - H(u t3 ) = H row + H col - H. 



joint 



i j ij 

where here and in the following we index quantities with joint, row, and col to 
denote to which distribution the quantity refers. 

Crude bounds for /(it). Estimates for the robust IDM interval 
[mintgA-EtfZ] , max te A-Et[X]] can be obtained by [minimizing, maximizing] I(u). A 
crude upper bound can be obtained as 

7 := max /(it) = max[H row + H col - H joint ] < 

4-r- A L 



max H row + max H co i — min Hj oint — H row + H co i — H_j oint , 

where exact solutions to H row , H_ row and H_joint are available from Sectional Simi- 
larly I > H row +H cc[ — Hjoint . The problem with these bounds is that, although good 
in some cases, they can become arbitrarily crude. The following 0(a 2 ) bound can 
be derived by exploiting the error sum propagation Theorem 0] 
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Theorem 6 (Bound on lower and upper Mutual Information) The follow- 
ing bounds on the expected mutual information I(u)=E t [T] are valid: 

/(n 1 )^/ ^Io + Ir and I + 1% E J C I{u 2 ), where 
I = I(u°) = H 0row + H 0co i - H 0joint = h(u° i+ ) + h(u° +j ) - /i(w°), 
I^r E H^ Rrow + HJ Rcol — H l ijRj oint = h + h (u° + j) — h (u^+a), 
ifjR 3 H l iRrow + H l j b Rcol — H^ b R j oint = h {u° i+ + a) + h (u° + j + a) — h (w?-), 

with h defined in J3J) ; andt^ = 0, and tjj = 6 \ij)(ijy with (ij) 1 — argmax^i^, and 

t l= 6 (im) 2 mth (*i) 2 = ar g min i/§H- 

7 IDM for Product Spaces 

Product spaces Q = Q\X...xQ m with = {l,...dfc} occur frequently in practical 
problems, e.g. in the mutual information (m = 2), in robust trees (m = 3), or in 
Bayesian nets in general (m large). Without loss of generality we only discuss the 
m = 2 case in the following. Ignoring the underlying structure in Q, a Dirichlet prior 
in case of unknown chances tt^ and an IDM as used in Section |H1 with 

t e A := {t e R dlXd2 = R dl ® R d2 : t y > OVij, t ++ = 1} (13) 

seems natural. On the other hand, if we take into account the structure of Q and go 
back to the original motivation of IDM this choice is far less obvious. Recall that 
one of the major motivations of IDM was its reparametrization invariance in the 
sense that inferences are not affected when grouping or splitting events in Q. For 
unstructured spaces like this is a reasonable principle. For illustration, let us 
consider objects of various shape and color, i.e. = Qixfl 2 , fii = {ball, pen, die,...}, 
^2 = {yellow, red,green,...} in generalization to Walleys bag of marbles example. 
Assume we want to detect a potential dependency between shape and color by 
means of their mutual information I. If we have no prior idea on the possible kind 
of colors, a model which is independent of the choice of is welcome. Grouping 
red and green, for instance, corresponds to (xa, x^ x^, x^,...) ~* (xn, ^2 + ^3, 
Xj4,...) for all shapes i, where a;e {n,7r,t,u}. Similarly for the different shapes, for 
instance we could group all round or all angular objects. The "smallest IDM" which 
respects this invariance is the one which considers all 

t e A:= A dl <g> A d2 C A. (14) 

The tensor or outer product <S> is defined as {v®w)ij\=ViWj and V®W := {v®w : 
v &V,w elf}. It is a bilinear (not linear!) mapping. This "small tensor" IDM 
is invariant under arbitrary grouping of columns and rows of the chance matrix 
( 7r ii)i<i<<ii,i<i<d2- m contrast to the larger A IDM model it is not invariant under 
arbitrary grouping of matrix cells, but there is anyway little motivation for the 
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necessity of such a general invariance. General non- column/row cross groupings 
would destroy the product structure of Q and with that the mere concepts of shape 
and color, and their correlation. For m > 2 as in Bayes-nets cross groupings look 
even less natural. Whether the A or the larger simplex A is the more appropriate 
IDM model depends on whether one regards the structure Qx x O2 of as a natural 
prior knowledge or as an arbitrary a posteriori choice. The smaller IDM has the 
potential advantage of leading to more precise predictions (smaller robust sets). 

Let us consider an estimator F:A—>-]R and its restriction : A— > M. Robust 
intervals \F_,F] for A are generally wider than robust intervals [i^-F®] for A. Fortu- 
nately not much. Although A is a lower- dimensional subspace of A, it contains all 
vertices of A. This is possible since A is a nonlinear subspace. The set of "vertices" 
in both cases is {t : tij = 5u 5jj , zoGf^i, jo^^}- Hence, if 'the robust interval bound- 
aries F_ are assumed in the vertices of A then the interval for the A IDM model is 
the same (F_=F^). Since the condition is "approximately" true, the conclusion is 
"approximately" true. More precisely: 

Theorem 7 (IDM bounds for product spaces) The 0(a 2 ) bounds of Theorem 
[3] on the robust interval F_ in the full IDM model A M'J\) . remain valid for F^ in the 
product IDM model A fli^| ). 

Proof. F^u 1 ) <F®<F<F + F% b = F(u r ) + 0(a 2 ), 

where F® := max te ai r ( , u) and u 1 was the u Fr maximizing" vertex as defined in 
Theorem inK-FfW) CF). The first inequality follows from the fact that all A vertices 
also belong to A, i.e. t 1 e A. The second inequality follows from Ac A. The remaining 
(in) equalities follow from Theorem El This shows that \F® — F\ = 0(a 2 ), hence 
F +Fft is also an 0(a 2 ) upper bound to F®. This implies that to the approximation 
accuracy we can achieve, the choice between A and A is irrelevant. □ 

8 Robust Credible Intervals 

Bayesian credible sets/intervals. For a probability distribution p:M d —>- [0,1], an 
a-credible region is a measurable set A for which p(A) := Jp(x)xA(x)d d x>a, where 
Xa(x) = 1 if A and otherwise, i.e. xEA with probability at least a. For given 
a, there are many choices for A. Often one is interested in "small" sets, where the 
size of A may be measured by its volume Vol (A) := f XA(x)d d x. Let us define a/the 
smallest a-credible set 

A min := argminVol(A) 

A:p(A)>a 

with ties broken arbitrarily. For unimodal p, A mm can be chosen as a connected set. 
For d—1 this means that A min =[a,b] with £p(x)dx =a is a minimal length a-credible 
interval. If, additionally p is symmetric around E[x], then A mm = [E[x}—a,E[x}+a\ 
is also symmetric around E[x}. 
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Robust credible sets. If we have a set of probability distributions {pt{x), £GT}, 
we can choose for each t an a-credible set A t with pt{A t ) >a, a minimal one being 
A™ m :—axgmin A . pt ( A \ >a Vol(A). A robust a-credible set is a set A which contains x 
with pt-probability at least a for all t. A minimal size robust a-credible set is 

A min := argmin Vol(A) (15) 

A=l>tA t :p t (A t )>aVteT 

It is not easy to deal with this expression, since A mm is not a function of {v4™ n :t€T}, 
and especially does not coincide with [J t A^ im as one might expect. 

Robust credible intervals. This can most easily be seen for univariate symmetric 
unimodal distributions, where t is a translation, e.g. pt(x) = Normal =t,a= 1) 
with 95% credible intervals Af n = [t-2,t+2). For, e.g. T=[-l,l] we get \J t A™ in = 
[—3,3]. The credible intervals move with t. One can get a smaller union if we take the 
intervals A s t ym = [—at, at] symmetric around 0. Since A s t ym is a non-central interval 
w.r.t. pt for tj^O, we have at>2, i.e. A s t ym is larger than A™ m , but one can show that 
the increase of a t is smaller than the shift of A™ m by t, hence we save something in 
the union. The optimal choice is neither A s t vm nor A™ in , but something in-between. 
In the extended version [Hut03j this is illustrated for the triangular distribution 
Pt(x) = max{0 , 1— \x— t\} with t GT:= [—7,7], where closed form solutions can be 
given. 

An interesting open question is under which general conditions we can expect 
A min C |J t v4™ n . In any case, [j t A t can be used as a conservative estimate for a 
robust credible set, since Pt(\J t rA t >) >Pt{A t ) >a for all t. 

A special (but important) case which falls outside the above framework are one- 
sided credible intervals, where only A t of the form [a, 00) are considered. In this case 
A min = \J t A^ in , i.e. A min =[a min ,oo) with a min = max{a:p t ([a,oo}) >aVt}. 

Approximations. For complex distributions like for the mutual information we 
have to approximate (|15|) somehow. We use the following notation for shortest 
a-credible intervals w.r.t. a univariate distribution pt{x): 

x t = [x t ,x t ] = [E t [x] - Ax t , E t [x) + Ax t ] := argmin (b -a), 

~ ~ [a,b\:pt{[a,b])>a 

where Ax t :=x t — E t [x] (Ax t := E t [x] — x t ) is the distance from the right boundary 
Xt (left boundary x t ) of the shortest a-credible interval Xt to the mean E t [x] of 
distribution p t . We can use x= [x,x\ '—\J t (conservative, but not shortest) 

robust credible interval, since Pt(x) >pt(xt) > a f° r & H t. We can upper bound x 
(and similarly lower bound x) by 

x = max(E t [x] + Ax t ) < max£t[x] + max Axt = E[x]+Ax. (16) 

We have already intensively discussed how to compute upper and lower quantities, 
particularly for the upper mean E[x] for xE {FjTL,!,...}, but the linearization tech- 
nique introduced in Section HI is general enough to deal with all in t differentiable 
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quantities, including Ax t . For example for Gaussian p t with variances at we have 
Ax t = KO't with k given by a = erf(K / y2) , where erf is the error function (e.g. k = 2 
for a~95%). We only need to estimate max t cr t . 

For non-Gaussian distributions, exact expression for Ax t are often hard or im- 
possible to obtain and to deal with. Non-Gaussian distributions depending on some 
sample size n are usually close to Gaussian for large n due to the central limit theo- 
rem. One may simply use na t in place of Ax t also in this case, keeping in mind that 
this could be a non-conservative approximation. More systematically, simple (and 
for large n good) upper bounds on Ax t can often be obtained and should preferably 
be used. 

Further, we have seen that the variation of sample depending different iable func- 
tions (like E t [x] = E t [x\n}) w.r.t. ££ A are of order Since in such cases the 
standard deviation a t ~ rT 1 ! 2 ~ Ax t is itself suppressed, the variation of Ax t with 
t is of order n~ 3 / 2 . If we regard this as negligibly small, we may simply fix some 

re A: 

max Axt = K&t* + 0(n~ 3 ' 2 ) 

Since Ax t is "nearly" constant, this also shows that we lose at most 0(n~ 3 / 2 ) pre- 
cision in the bound (|16|) (equality holds for Ax t independent of t). Expressions for 
the variance of X, for instance, have been derived in |WW95[ lHut02j : 

9 Conclusions 

This is the first work, providing a systematic approach for deriving closed form 
expressions for interval estimates in the Imprecise Dirichlet Model (IDM). We con- 
centrated on exact and conservative ro b ust interval ([lower, upper]) estimates for con- 
cave functions F — ^.fi on simplices, like the entropy. The conservative estimates 
widened the intervals by 0(n~ 2 ), where n is the sample size. Here is a dilemma, of 
course: For large n the approximations are good, whereas for small n the bounds 
are more interesting, so the approximations will be most useful for intermediate 
n. More precise expressions for small n would be highly interesting. We have also 
indicated how to propagate robust estimates from simple functions to composite 
functions, like the mutual information. We argued that a reduced IDM on prod- 
uct spaces, like Bayesian nets, is more natural and should be preferred in order to 
improve predictions. Although improvement is formally only 0(n~ 2 ), the difference 
may be significant in Bayes nets or for very small n. Finally, the basics of how to 
combine robust with credible intervals have been laid out. Under certain conditions 
0(n~ 3 / 2 ) approximations can be derived, but the presented approximations are not 
conservative. All in all this work has shown that IDM has not only interesting the- 
oretical properties, but that explicit (exact /conservative/approximate) expressions 
for robust (credible) intervals for various quantities can be derived. The computa- 
tional complexity of the derived bounds on F = Y2ifi is very small, typically one or 
two evaluations of F or related functions, like its derivative. First applications of 
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these (or more precisely, very similar) results, especially the mutual information, to 
robust inference of trees look promising jZH03 . 
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